Propagate scoring function through random sampler #116957

jan-elastic · 2024-11-18T15:32:10Z

fixes: #110134

.../elasticsearch/search/aggregations/bucket/sampler/random/RandomSamplerAggregatorFactory.java

elasticsearchmachine · 2024-11-18T15:37:53Z

Pinging @elastic/ml-core (Team:ML)

elasticsearchmachine · 2024-11-18T15:37:54Z

Hi @jan-elastic, I've created a changelog YAML for you.

jan-elastic · 2024-11-18T19:29:52Z

Example request / response for the fix:

POST /test-index/_search
{
  "query": {
    "function_score": {
      "random_score": {
      }
    }
  },
  "aggs": {
    "random_sampler": {
      "random_sampler": {
        "probability": 0.5
      },
      "aggs": {
        "samples": {
          "top_hits": {
            "size": 2,
            "_source": [
              "text"
            ]
          }
        }
      }
    }
  }
}

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": 0.96045524,
    "hits": [
      {
        "_index": "test-index",
        "_id": "ZjX1G5MBaC3tcnySK34-",
        "_score": 0.96045524,
        "_source": {
          "text": "text1"
        }
      },
      {
        "_index": "test-index",
        "_id": "ZTXrG5MBaC3tcnySAH4a",
        "_score": 0.82856405,
        "_source": {
          "text": [
            "text1",
            "text2"
          ]
        }
      },
      {
        "_index": "test-index",
        "_id": "aDX1G5MBaC3tcnySK34-",
        "_score": 0.33547562,
        "_source": {
          "text": [
            "text3",
            "text4"
          ]
        }
      },
      {
        "_index": "test-index",
        "_id": "ZDXrG5MBaC3tcnySAH4Z",
        "_score": 0.14092654,
        "_source": {
          "text": "text1"
        }
      },
      {
        "_index": "test-index",
        "_id": "ZzX1G5MBaC3tcnySK34-",
        "_score": 0.031006515,
        "_source": {
          "text": [
            "text1",
            "text2"
          ]
        }
      }
    ]
  },
  "aggregations": {
    "random_sampler": {
      "seed": -820777490,
      "probability": 0.5,
      "doc_count": 2,
      "samples": {
        "hits": {
          "total": {
            "value": 2,
            "relation": "eq"
          },
          "max_score": 0.96045524,
          "hits": [
            {
              "_index": "test-index",
              "_id": "ZjX1G5MBaC3tcnySK34-",
              "_score": 0.96045524,
              "_source": {
                "text": "text1"
              }
            },
            {
              "_index": "test-index",
              "_id": "ZDXrG5MBaC3tcnySAH4Z",
              "_score": 0.14092654,
              "_source": {
                "text": "text1"
              }
            }
          ]
        }
      }
    }
  }
}

Without the `random_score` you get similar results, all with scores of `1.0`.

benwtrent · 2024-11-18T20:01:12Z

@jan-elastic I wonder how it would be with multiple layers of nesting? The original issue that @dgieselaar ran into I think had multiple layers of aggregations (random_sampler -> categorize_text -> top hits)

...ava/org/elasticsearch/search/aggregations/bucket/sampler/random/RandomSamplerAggregator.java

jan-elastic · 2024-11-19T08:27:51Z

Regarding multiple layers of nesting: look like it works fine @benwtrent

{
  "query": {
    "function_score": {
      "random_score": {
      }
    }
  },
  "aggs": {
    "random_sampler": {
      "random_sampler": {
        "probability": 0.5
      },
      "aggs": {
        "message": {
          "categorize_text": {
            "size": 6,
            "field": "text"
          },
          "aggs": {
            "samples": {
              "top_hits": {
                "size": 1,
                "_source": [
                  "text"
                ]
              }
            }
          }
        }
      }
    }
  }
}

{
  "took": 14,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 5,
      "relation": "eq"
    },
    "max_score": 0.8408959,
    "hits": [
      {
        "_index": "test-index",
        "_id": "ZjX1G5MBaC3tcnySK34-",
        "_score": 0.8408959,
        "_source": {
          "text": "text1"
        }
      },
      {
        "_index": "test-index",
        "_id": "ZzX1G5MBaC3tcnySK34-",
        "_score": 0.72381663,
        "_source": {
          "text": [
            "text1",
            "text2"
          ]
        }
      },
      {
        "_index": "test-index",
        "_id": "ZDXrG5MBaC3tcnySAH4Z",
        "_score": 0.50730485,
        "_source": {
          "text": "text1"
        }
      },
      {
        "_index": "test-index",
        "_id": "ZTXrG5MBaC3tcnySAH4a",
        "_score": 0.24077094,
        "_source": {
          "text": [
            "text1",
            "text2"
          ]
        }
      },
      {
        "_index": "test-index",
        "_id": "aDX1G5MBaC3tcnySK34-",
        "_score": 0.17789865,
        "_source": {
          "text": [
            "text3",
            "text4"
          ]
        }
      }
    ]
  },
  "aggregations": {
    "random_sampler": {
      "seed": -685459259,
      "probability": 0.5,
      "doc_count": 2,
      "message": {
        "buckets": [
          {
            "doc_count": 4,
            "key": "text1",
            "regex": ".*?text1.*?",
            "max_matching_length": 5,
            "samples": {
              "hits": {
                "total": {
                  "value": 2,
                  "relation": "eq"
                },
                "max_score": 0.72381663,
                "hits": [
                  {
                    "_index": "test-index",
                    "_id": "ZzX1G5MBaC3tcnySK34-",
                    "_score": 0.72381663,
                    "_source": {
                      "text": [
                        "text1",
                        "text2"
                      ]
                    }
                  }
                ]
              }
            }
          },
          {
            "doc_count": 4,
            "key": "text2",
            "regex": ".*?text2.*?",
            "max_matching_length": 5,
            "samples": {
              "hits": {
                "total": {
                  "value": 2,
                  "relation": "eq"
                },
                "max_score": 0.72381663,
                "hits": [
                  {
                    "_index": "test-index",
                    "_id": "ZzX1G5MBaC3tcnySK34-",
                    "_score": 0.72381663,
                    "_source": {
                      "text": [
                        "text1",
                        "text2"
                      ]
                    }
                  }
                ]
              }
            }
          }
        ]
      }
    }
  }
}

benwtrent

I think for initial work, this is in a good place. Let's add some yaml testing and such.

...s/aggregations/src/yamlRestTest/resources/rest-api-spec/test/aggregations/random_sampler.yml

elasticsearchmachine · 2024-11-20T13:04:25Z

Hi @jan-elastic, I've updated the changelog YAML for you.

benwtrent

elasticsearchmachine · 2024-11-20T14:36:22Z

💔 Backport failed

Status	Branch	Result
❌	8.16	Commit could not be cherrypicked due to conflicts
✅	8.x

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 116957

* Propagate scoring function through random sampler. * Update docs/changelog/116957.yaml * Correct score mode in random sampler weight * Fix random sampling with scores and p=1.0 * Unit test with scores * YAML test * Add capability

elasticsearchmachine added needs:triage Requires assignment of a team area label v9.0.0 labels Nov 18, 2024

jan-elastic commented Nov 18, 2024

View reviewed changes

.../elasticsearch/search/aggregations/bucket/sampler/random/RandomSamplerAggregatorFactory.java Outdated Show resolved Hide resolved

jan-elastic commented Nov 18, 2024

View reviewed changes

.../elasticsearch/search/aggregations/bucket/sampler/random/RandomSamplerAggregatorFactory.java Show resolved Hide resolved

jan-elastic requested a review from benwtrent November 18, 2024 15:36

jan-elastic added >bug :ml Machine learning Team:ML Meta label for the ML team v8.17.0 and removed needs:triage Requires assignment of a team area label labels Nov 18, 2024

benwtrent reviewed Nov 18, 2024

View reviewed changes

...ava/org/elasticsearch/search/aggregations/bucket/sampler/random/RandomSamplerAggregator.java Outdated Show resolved Hide resolved

jan-elastic force-pushed the propagate-scoring-random-sampler branch from b27e007 to cfe449a Compare November 19, 2024 13:20

benwtrent reviewed Nov 19, 2024

View reviewed changes

jan-elastic and others added 5 commits November 20, 2024 08:44

Propagate scoring function through random sampler.

b5d1e77

Update docs/changelog/116957.yaml

bd5f799

Correct score mode in random sampler weight

99f2083

Fix random sampling with scores and p=1.0

9891403

Unit test with scores

b09af6a

jan-elastic force-pushed the propagate-scoring-random-sampler branch from cfe449a to b09af6a Compare November 20, 2024 07:45

YAML test

38b4488

benwtrent reviewed Nov 20, 2024

View reviewed changes

...s/aggregations/src/yamlRestTest/resources/rest-api-spec/test/aggregations/random_sampler.yml Outdated Show resolved Hide resolved

benwtrent added auto-backport Automatically create backport pull requests when merged v8.16.1 labels Nov 20, 2024

Add capability

bbdc84e

jan-elastic force-pushed the propagate-scoring-random-sampler branch from 490fb1e to bbdc84e Compare November 20, 2024 13:23

benwtrent approved these changes Nov 20, 2024

View reviewed changes

jan-elastic merged commit dea1e7d into main Nov 20, 2024
17 checks passed

jan-elastic deleted the propagate-scoring-random-sampler branch November 20, 2024 14:34

jan-elastic mentioned this pull request Nov 20, 2024

[8.x] Propagate scoring function through random sampler (#116957) #117162

Merged

elasticsearchmachine added the backport pending label Nov 20, 2024

jan-elastic mentioned this pull request Nov 20, 2024

[8.16] Propagate scoring function through random sampler (#116957) #117165

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Propagate scoring function through random sampler #116957

Propagate scoring function through random sampler #116957

Uh oh!

jan-elastic commented Nov 18, 2024

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 18, 2024

Uh oh!

elasticsearchmachine commented Nov 18, 2024

Uh oh!

jan-elastic commented Nov 18, 2024 •

edited

Loading

Uh oh!

benwtrent commented Nov 18, 2024

Uh oh!

Uh oh!

jan-elastic commented Nov 19, 2024

Uh oh!

benwtrent left a comment

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 20, 2024

Uh oh!

benwtrent left a comment

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Propagate scoring function through random sampler #116957

Propagate scoring function through random sampler #116957

Uh oh!

Conversation

jan-elastic commented Nov 18, 2024

Uh oh!

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 18, 2024

Uh oh!

elasticsearchmachine commented Nov 18, 2024

Uh oh!

jan-elastic commented Nov 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benwtrent commented Nov 18, 2024

Uh oh!

Uh oh!

jan-elastic commented Nov 19, 2024

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 20, 2024

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elasticsearchmachine commented Nov 20, 2024

💔 Backport failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jan-elastic commented Nov 18, 2024 •

edited

Loading